Introduction

COMETS Analytics support all cohort-specific analyses of the COMETS consortium. This collaborative work is done via the COMETS harmonization group activities. For more information, see the [COMETS website] (http://epi.grants.cancer.gov/comets/).

Data Input Format

The required input file shoud be in excel format with the following 6 sheets:

  1. Metabolites - from harmonized metabolites output
  2. SubjectMetabolites - abundances in columns and subject in rows
  3. SubjectData - other exposure and adjustment variables
  4. VarMap - maps the variables needed to conduct the cohort specific analysis. Specify the name of variables under CohortVariable column. if the VarReference has the same name in the cohort, leave blank
  5. Models - models used to conduct COMETS analysis. Outcomes, exposures and adjustment can specify multiple covariates delimited by spaces (ie: age bmi).
  6. ModelOptions - optional sheet containing model specific options

An example input file is available HERE.

Analysis Workflow for correlation analysis

1. Load Data

The first step of the analysis is to load in the data with the readCOMETSinput() function. Input for this function is an Excel spreadsheet, per the description above.

# Retrieve the full path of the input data
dir <- system.file("extdata", package="COMETS", mustWork=TRUE)
csvfile <- file.path(dir, "cometsInputAge.xlsx")

# Read in and process the input data
exmetabdata <- COMETS::readCOMETSinput(csvfile)
## [1] "Metabolites sheet is read in"
## [1] "SubjectMetabolites sheet is read in"
## [1] "SubjectData sheet is read in"
## [1] "VarMap sheet is read in"
## [1] "Models sheet is read in"
## [1] "ModelOptions sheet is read in"
## [1] "There are 14 categorical variables"
## [1] "Running Integrity Check..."
## Joining, by = "id"
## Joining, by = "hmdb_id"
## [1] "Input data has passed QC (metabolite and sample names match in all input files)"

To plot some the distribution of variances for each metabolite:

COMETS::plotVar(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

To plot the distribution of minimum/missing values:

COMETS::plotMinvalues(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

2. Get Model Data

There are 2 ways to specify your model, batch or interactive. In Batch mode, models are specified in your input file Models sheet. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run. The following call defines the “1 Gender adjusted” model from the Models sheet in the input file to be run.

exmodeldata <- COMETS::getModelData(exmetabdata,modlabel="1 Gender adjusted")

In Interactive mode, models are specified as parameters. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run.
The following call defines the model with age and bmi_grp as the exposure variables, and includes only the subjects with age > 40 and bmi_grp > 2.

exmodeldata2 <- COMETS::getModelData(exmetabdata, modelspec="Interactive",
    exposures=c("age","bmi_grp"), where=c("age>40","bmi_grp>2"))
## [1] "Filtering subjects according to the rule(s) age>40 & bmi_grp>2 . 279 of 1000 are retained."

3. Run Simple Unstratified Correlation Analysis

The runModel() function is the main function for running a single model, and by default, a correlation analysis is performed.

excorrdata  <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")

The output of the correlation analysis can then be compiled and output to an Excel file with the following function:

COMETS::OutputListToExcel(filename="DPP_corr.xlsx", excorrdata)

To view the first 3 lines of the correlation analysis output, simply type:

COMETS::showModel(excorrdata,nlines=3)
## 
## ModelSummary:
##   run cohort        spec model                   outcomespec exposurespec nobs
## 1   1    DPP Interactive       _1_2_3_benzenetriol_sulfate_2          age  279
## 2   2    DPP Interactive            _1_2_dipalmitoylglycerol          age  279
## 3   3    DPP Interactive                    _1_2_propanediol          age  279
##   message adjvars adjvars.removed adjspec   outcome_uid
## 1                                         CHEM100006374
## 2                                             HMDB07098
## 3                                             HMDB01881
##                          outcome exposure_uid adj_uid
## 1 1,2,3-benzenetriol sulfate (2)          age        
## 2              DG(16:0/16:0/0:0)          age        
## 3               Propylene glycol          age        
## 
## Effects:
##   run                   outcomespec exposurespec term        corr     p.value
## 1   1 _1_2_3_benzenetriol_sulfate_2          age  age 0.164624501 0.005846722
## 2   2      _1_2_dipalmitoylglycerol          age  age 0.068903188 0.251337451
## 3   3              _1_2_propanediol          age  age 0.001667259 0.977882521
## NULL

To display the heatmap of the resulting correlation matrix, use the showheatmap function.

COMETS::showHeatmap(excorrdata)




To display the hierarchical clustering of the resulting correlation matrix, use the showHClust function. This diplay requires at least 2 rows and 2 columns in the correlation matrix.

exmodeldata<-COMETS::getModelData(exmetabdata,modelspec = "Interactive",exposures = c("bmi_grp","age"))
excorrdata  <- COMETS::runModel(exmodeldata,exmetabdata,"DPP")
COMETS::showHClust(excorrdata, showticklabels=FALSE)

Results can be written to an output Excel file with the following command:

COMETS::OutputListToExcel("Model1.xlsx", excorrdata)

4. Run Stratified Correlation Analysis

A stratified correlation analysis can be performed by specifiying stratification variables in the call to getModelData(). If more than one stratification variable is specified, then the strata will be defined by all unique combinations of the stratification variables. The following call will define a model stratified by race_grp.

  exmodeldata2 <- COMETS::getModelData(exmetabdata,modelspec="Interactive",
                   outcomes=c("lactose","lactate"),
                exposures=c("age","bmi_grp"),strvars="race_grp")

The stratified correlation analysis is run by calling the runModel() function.

  excorrdata2  <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")

5. Linear regression with lm

Call getModelData() to define a model which adjusts for age group, has lactose and lactate as outcome variables, and has age and bmi group as the exposure variables.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   outcomes=c("lactose","lactate"), exposures=c("age","bmi_grp"))

To run a linear regression using the lm function, a list of options must be passed into runModel() with the model option set to “lm”.

  lm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="lm"))
  print(lm_results)
## $ModelSummary
##   run cohort        spec model outcomespec exposurespec wald.pvalue  r.squared
## 1   1    DPP Interactive           lactose          age  0.16451267 0.01467966
## 2   2    DPP Interactive           lactate          age  0.60149903 0.00165591
## 3   3    DPP Interactive           lactose      bmi_grp  0.02387982 0.02206445
## 4   4    DPP Interactive           lactate      bmi_grp  0.00166282 0.01641147
##   adj.r.squared     sigma     loglik       aic       bic   deviance df.residual
## 1    0.01171183 1.8029707 -2006.3702 4022.7404 4047.2792 3237.70036         996
## 2   -0.00135115 0.2882131  -172.8794  355.7588  380.2975   82.73452         996
## 3    0.01714526 1.7980076 -2002.6087 4019.2173 4053.5716 3213.43441         994
## 4    0.01146384 0.2863629  -165.4342  344.8684  379.2227   81.51171         994
##   nobs message             adjvars adjvars.removed adjspec outcome_uid
## 1 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 2 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
## 3 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 4 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
##         outcome exposure_uid             adj_uid
## 1 Alpha-Lactose          age age_grp.2;age_grp.3
## 2 L-Lactic acid          age age_grp.2;age_grp.3
## 3 Alpha-Lactose      bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid      bmi_grp age_grp.2;age_grp.3
## 
## $Effects
##   run outcomespec exposurespec      term     estimate   std.error  statistic
## 1   1     lactose          age       age  0.034697534 0.024961296  1.3900534
## 2   2     lactate          age       age -0.002083854 0.003990177 -0.5222460
## 3   3     lactose      bmi_grp bmi_grp.2 -0.047905160 0.134163189 -0.3570663
## 4   3     lactose      bmi_grp bmi_grp.3  0.360610471 0.145954081  2.4707118
## 5   3     lactose      bmi_grp bmi_grp.4  0.420224133 0.577141138  0.7281133
## 6   4     lactate      bmi_grp bmi_grp.2  0.034207637 0.021367743  1.6009008
## 7   4     lactate      bmi_grp bmi_grp.3  0.080743228 0.023245640  3.4734783
## 8   4     lactate      bmi_grp bmi_grp.4 -0.126241151 0.091919426 -1.3733892
##        p.value
## 1 0.1648232433
## 2 0.6016151626
## 3 0.7211179261
## 4 0.0136512756
## 5 0.4667157218
## 6 0.1097166272
## 7 0.0005359045
## 8 0.1699410577
## 
## attr(,"ptime")
## [1] "Processing time: 0.16 sec"

6. Linear regression with glm

Run a linear regression using the glm function for the same variables as above. The default family used with glm is “gaussian”, which corresponds to a linear regression. The Effects data frame will be the same as with lm, but the ModelSummary data frame will contain some different columns.

  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="glm"))
  print(all.equal(lm_results$Effects, glm_results$Effects))
## [1] TRUE
  print(glm_results$ModelSummary)
##   run cohort        spec model outcomespec exposurespec converged wald.pvalue
## 1   1    DPP Interactive           lactose          age         1  0.16451267
## 2   2    DPP Interactive           lactate          age         1  0.60149903
## 3   3    DPP Interactive           lactose      bmi_grp         1  0.02387982
## 4   4    DPP Interactive           lactate      bmi_grp         1  0.00166282
##   null.deviance df.null     loglik       aic       bic   deviance df.residual
## 1    3285.93681     999 -2006.3702 4022.7404 4047.2792 3237.70036         996
## 2      82.87175     999  -172.8794  355.7588  380.2975   82.73452         996
## 3    3285.93681     999 -2002.6087 4019.2173 4053.5716 3213.43441         994
## 4      82.87175     999  -165.4342  344.8684  379.2227   81.51171         994
##   nobs message             adjvars adjvars.removed adjspec outcome_uid
## 1 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 2 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
## 3 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 4 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
##         outcome exposure_uid             adj_uid
## 1 Alpha-Lactose          age age_grp.2;age_grp.3
## 2 L-Lactic acid          age age_grp.2;age_grp.3
## 3 Alpha-Lactose      bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid      bmi_grp age_grp.2;age_grp.3

7. Logistic regression with glm

Call getModelData() to define a model which adjusts for age group, has nested_case as the outcome variable, and has lactose and lactate as the exposure variables. The variable nested_case must be a binary 0-1 variable.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   outcomes="nested_case", exposures=c("lactose","lactate"))

To run a logistic regression, the list of options op must also include a model.options list with family set to “binomial”.

  op <- list(model="glm", model.options=list(family="binomial"))
  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)
  print(glm_results)
## $ModelSummary
##   run cohort        spec model outcomespec exposurespec converged wald.pvalue
## 1   1    DPP Interactive       nested_case      lactose         1  0.51856182
## 2   2    DPP Interactive       nested_case      lactate         1  0.01003264
##   null.deviance df.null    loglik      aic      bic deviance df.residual nobs
## 1      1386.278     999 -692.5559 1393.112 1412.743 1385.112         996 1000
## 2      1386.278     999 -689.4160 1386.832 1406.463 1378.832         996 1000
##   message             adjvars adjvars.removed adjspec outcome_uid     outcome
## 1         age_grp.2;age_grp.3                 age_grp nested_case nested_case
## 2         age_grp.2;age_grp.3                 age_grp nested_case nested_case
##   exposure_uid             adj_uid
## 1    HMDB00186 age_grp.2;age_grp.3
## 2    HMDB00190 age_grp.2;age_grp.3
## 
## $Effects
##   run outcomespec exposurespec    term   estimate  std.error statistic
## 1   1 nested_case      lactose lactose 0.02268456 0.03513914 0.6455639
## 2   2 nested_case      lactate lactate 0.57214513 0.22221799 2.5747022
##      p.value
## 1 0.51856182
## 2 0.01003264
## 
## attr(,"ptime")
## [1] "Processing time: 0.09 sec"

8. Poisson regression with glm

Call getModelData() to define a model which adjusts for age group, has n_visits as the outcome variable, and has lactose and lactate as the exposure variables. The variable n_visits must be a non-negative integer valued variable.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   outcomes="n_visits", exposures=c("lactose","lactate"))

To run a Poisson regression, the list of options op must also include a model.options list with family set to “poisson”.

  op <- list(model="glm", model.options=list(family="poisson"))
  poisson_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)
  print(poisson_results)
## $ModelSummary
##   run cohort        spec model outcomespec exposurespec converged wald.pvalue
## 1   1    DPP Interactive          n_visits      lactose         1   0.6105400
## 2   2    DPP Interactive          n_visits      lactate         1   0.8271571
##   null.deviance df.null    loglik      aic      bic deviance df.residual nobs
## 1       632.649     999 -1538.073 3084.146 3103.777 630.4096         996 1000
## 2       632.649     999 -1538.179 3084.357 3103.988 630.6207         996 1000
##   message             adjvars adjvars.removed adjspec outcome_uid  outcome
## 1         age_grp.2;age_grp.3                 age_grp    n_visits n_visits
## 2         age_grp.2;age_grp.3                 age_grp    n_visits n_visits
##   exposure_uid             adj_uid
## 1    HMDB00186 age_grp.2;age_grp.3
## 2    HMDB00190 age_grp.2;age_grp.3
## 
## $Effects
##   run outcomespec exposurespec    term   estimate  std.error statistic
## 1   1    n_visits      lactose lactose 0.00624383 0.01225956 0.5093028
## 2   2    n_visits      lactate lactate 0.01679890 0.07693599 0.2183491
##     p.value
## 1 0.6105400
## 2 0.8271571
## 
## attr(,"ptime")
## [1] "Processing time: 0.1 sec"

9. Run Analysis on all models defined in the input Excell sheet (‘super-batch’ mode)

All models desginated in the input file can be run with one command, and individual output Excel files or correlation results will be written in the current directory by default. The function returns a list of objects.

 allresults <- COMETS::runAllModels(exmetabdata,writeTofile=F)
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1        httr_1.4.2           tidyr_1.1.2         
##  [4] jsonlite_1.7.2       viridisLite_0.3.0    splines_4.0.2       
##  [7] foreach_1.5.1        tmvnsim_1.0-2        prodlim_2019.11.13  
## [10] assertthat_0.2.1     stats4_4.0.2         cellranger_1.1.0    
## [13] yaml_2.2.1           ipred_0.9-9          pillar_1.4.7        
## [16] backports_1.2.0      lattice_0.20-41      glue_1.4.2          
## [19] pROC_1.16.2          digest_0.6.27        RColorBrewer_1.1-2  
## [22] colorspace_2.0-0     recipes_0.1.15       htmltools_0.5.0     
## [25] Matrix_1.2-18        plyr_1.8.6           psych_2.0.12        
## [28] timeDate_3043.102    pkgconfig_2.0.3      ISwR_2.0-8          
## [31] broom_0.7.3          caret_6.0-86         corpcor_1.6.9       
## [34] purrr_0.3.4          scales_1.1.1         webshot_0.5.2       
## [37] gower_0.2.2          lava_1.6.8.1         tibble_3.0.4        
## [40] farver_2.0.3         generics_0.1.0       ggplot2_3.3.3       
## [43] ellipsis_0.3.1       withr_2.3.0          nnet_7.3-14         
## [46] lazyeval_0.2.2       mnormt_2.0.2         survival_3.1-12     
## [49] magrittr_2.0.1       crayon_1.3.4         readxl_1.3.1        
## [52] heatmaply_1.1.1      evaluate_0.14        nlme_3.1-148        
## [55] MASS_7.3-51.6        class_7.3-17         tools_4.0.2         
## [58] registry_0.5-1       data.table_1.13.6    lifecycle_0.2.0     
## [61] stringr_1.4.0        plotly_4.9.3         munsell_0.5.0       
## [64] compiler_4.0.2       rlang_0.4.10         grid_4.0.2          
## [67] iterators_1.0.13     htmlwidgets_1.5.3    crosstalk_1.1.0.1   
## [70] labeling_0.4.2       rmarkdown_2.6        subselect_0.15.2    
## [73] gtable_0.3.0         ModelMetrics_1.2.2.2 codetools_0.2-16    
## [76] TSP_1.1-10           reshape2_1.4.4       R6_2.5.0            
## [79] seriation_1.2-9      gridExtra_2.3        lubridate_1.7.9.2   
## [82] knitr_1.30           dplyr_1.0.2          COMETS_1.5.0.0      
## [85] dendextend_1.14.0    stringi_1.5.3        parallel_4.0.2      
## [88] Rcpp_1.0.5           vctrs_0.3.6          rpart_4.1-15        
## [91] tidyselect_1.1.0     xfun_0.20